Getting Set Up

If you haven’t already, create a folder for this course, and then a subfolder within for the second lecture Topic9_Clustering, and two additional subfolders within code and data.

Open RStudio and create a new RMarkDown file (.Rmd) by going to File -> New File -> R Markdown.... Change the title to "DS1000: Problem Set 7" and the author to your full name. Save this file as [LAST NAME]_ps7.Rmd to your code folder.

If you haven’t already, download the CountyVote2004_2020.Rds file from the course github page and save it to your data folder.

All of the following questions should be answered using

Require tidyverse and load the CountyVote2004_2020.Rds data to dat.

# INSERT CODE HERE
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
dat <- readRDS("../data/CountyVote2004_2020.rds")
glimpse(dat)
## Rows: 4,658
## Columns: 16
## $ state        <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK…
## $ county.code  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ county.name  <chr> "ED 1", "ED 2", "ED 3", "ED 4", "ED 5", "ED 6", "ED 7", "…
## $ fips.code    <chr> "02901", "02902", "02903", "02904", "02905", "02906", "02…
## $ geo.strata   <chr> "5--Fairbanks/Front", "5--Fairbanks/Front", "5--Fairbanks…
## $ party.strata <chr> "3--Middle", "4--Mod Republican", "5--High Republican", "…
## $ pct_dem_2020 <dbl> 0.4724185, 0.3415030, 0.2259827, 0.5048645, 0.4828854, 0.…
## $ pct_dem_2004 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_dem_2008 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_dem_2012 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_dem_2016 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_rep_2020 <dbl> 0.4770380, 0.5963318, 0.7215295, 0.4429961, 0.4682977, 0.…
## $ pct_rep_2004 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_rep_2008 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_rep_2012 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pct_rep_2016 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

HINT: Questions 1 and 2 can be found in the slides and recording for Monday’s lecture. Questions 3 - 5 + the extra credit can be found in the slides and recording for Wednesday’s lecture. Pay particular attention to the pre-lecture handouts if you get stuck!

Question 1 [4 points]

Describe the columns pct_rep_2020 and pct_rep_2004, following these steps:

  1. Look at the data and identify missingness. Which states have missing values?
  2. Visualize the data using univariate visualization for both measures, but put them on the same plot, differentiated by color. Do you notice any patterns?
  3. Visualize the data using multivariate (or conditional) visualization where pct_rep_2020 is the outcome and pct_rep_2004 is the predictor. Use the geom_abline() to create a 45 degree line on these plots. Do you notice any patterns? EXTRA CREDIT: Interpret this plot substantively and color the points by whether they are above or below the 45 degree line.
# INSERT CODE HERE
dat %>%
  count(is.na(pct_rep_2020))
## # A tibble: 2 × 2
##   `is.na(pct_rep_2020)`     n
##   <lgl>                 <int>
## 1 FALSE                  4623
## 2 TRUE                     35
dat %>%
  count(is.na(pct_rep_2004))
## # A tibble: 2 × 2
##   `is.na(pct_rep_2004)`     n
##   <lgl>                 <int>
## 1 FALSE                  4604
## 2 TRUE                     54
dat %>%
  group_by(state) %>%
  select(pct_rep_2004) %>%
  drop_na() %>%
  filter(is.na(pct_rep_2004))
## Adding missing grouping variables: `state`
## # A tibble: 0 × 2
## # Groups:   state [0]
## # … with 2 variables: state <chr>, pct_rep_2004 <dbl>
dat %>%
  group_by(state) %>%
  select(pct_rep_2020) %>%
  drop_na() %>%
  filter(is.na(pct_rep_2020)) 
## Adding missing grouping variables: `state`
## # A tibble: 0 × 2
## # Groups:   state [0]
## # … with 2 variables: state <chr>, pct_rep_2020 <dbl>
dat %>%
  ggplot() +
  geom_density(aes(x = pct_rep_2004, color = "lightblue")) +
  geom_density(aes(x = pct_rep_2020, color = "darkblue")) +
  scale_color_manual(values = (c("lightblue","darkblue")),
                     labels = (c("2004","2020"))) +
  labs(title = "Density of Republicans in 2004 and 2020",
       subtitle = "Data from CountyVote2004_2020.rds",
       x = "Percentage",
       y = "Density")
## Warning: Removed 54 rows containing non-finite values (stat_density).
## Warning: Removed 35 rows containing non-finite values (stat_density).

dat %>%
  mutate(higher = ifelse(pct_rep_2004 >= pct_rep_2020, pct_rep_2020, NA)) %>%
  mutate(lower = ifelse(pct_rep_2004 < pct_rep_2020, pct_rep_2020, NA)) %>%
  ggplot() +
  geom_point(aes(x = pct_rep_2004, y = higher, color = "red", alpha = 0.6)) +
  geom_point(aes(x = pct_rep_2004, y = lower, color = "blue", alpha = 0.6)) +
  geom_abline(intercept=0,slope=1) + 
  scale_color_manual(values = (c("red","blue")),
                     labels = (c("2004 Percentage Higher than 2020 Percentage","2004 Percentage Lower than 2020 Percentage"))) +
  labs(title = "2004 Percentage of Republican Votes as a Predictor of 2020 Percentage",
       subtitle = "Data from CountyVote2004_2020.rds",
       x = "Percentage in 2004",
       y = "Percentage in 2020")
## Warning: Removed 2897 rows containing missing values (geom_point).
## Warning: Removed 1845 rows containing missing values (geom_point).

  • In pct_rep_2020 there are 35 NA values while in pct_rep_2004 there are 54 NA values. Both Arkansas and Maine have missingness in 2004, while Maine and Virginia have missingness in 2020. Based on the univariate visualization, there is a greater percentage in 2004 than 2020. Similarly, based on the same univariate visualization of both years, there is a greater density in 2020 than 2004. In the graph titled “2004 Percentage of Republican Votes as a Predictor of 2020 Percentage”, red signifies that the 2004 percentage is higher than the 2020 percentage while blue signifies that the 2004 percentage is lower than the 2020 percentage.

Question 2 [4 points]

Perform k-means analysis on these variables with k = 2, and then plot the results, coloring the points by cluster assignment. Then loop over values of k from 1 to 30 and plot the “elbow plot” with k on the x-axis and the total within sum of squares on the y-axis. What value of k would you choose? Re-calculate with that value, and then plot again. NB: set nstart = 25 to ensure replicability! EXTRA CREDIT: Are you able to interpret these groups as a political scientist?

# INSERT CODE HERE
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
datClust <- dat %>% select(pct_rep_2004, pct_rep_2020) %>% drop_na()

clusterVotes1 <- kmeans(datClust %>% select(pct_rep_2004,pct_rep_2020), centers = 2)

ggVotes <- datClust %>%
  mutate(clusterVotes = clusterVotes1$cluster) %>%
  ggplot(aes(x = pct_rep_2004, y = pct_rep_2020, color = factor(clusterVotes))) +
  geom_point() +
  geom_abline(intercept = 0,slope = 1) +
  labs(title="Percentage in 2004 vs Percentage in 2020",
       subtitle = "Data from CountyVote2004_2020.rds",
       x="Percentage in 2004",
       y="Percentage in 2020",
       color = "Cluster")
       ggplotly(ggVotes,tooltip = "text")
totWSS <- NULL
for(k in 1:30) { 
  clusterScale <- datClust %>%
    kmeans(centers = k,nstart = 25, iter.max=30)
    totWSS <- data.frame(totWSS = clusterScale$tot.withinss,k = k) %>%
    bind_rows(totWSS)
}
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 228700)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 228700)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 228700)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 228700)
totWSS %>%
  ggplot(aes(x = k,y = totWSS)) +
  geom_line() +
  labs(title="Total WSS vs Number of Clusters",
       subtitle = "Data from CountyVote2004_2020.rds",
       x = "Number of Clusters",
       y = "Total WSS")

clusterVotes2 <- kmeans(datClust %>% select(pct_rep_2004,pct_rep_2020), centers = 10)
ggVotes <- datClust %>%
  mutate(clusterVotes = clusterVotes2$cluster) %>%
  ggplot(aes(x = pct_rep_2004, y = pct_rep_2020, color = factor(clusterVotes))) +
  geom_point() +
  geom_abline(intercept = 0,slope = 1) +
  labs(title="Percentage in 2004 vs Percentage in 2020",
       subtitle = "Data from CountyVote2004_2020.rds",
       x="Percentage in 2004",
       y="Percentage in 2020")
  ggplotly(ggVotes,tooltip = 'text')
set.seed(123)
clusterScale <- kmeans(datClust, centers = 10)
  • With K = 2, there are two clusters along the line with a slope of 1. However, using the “elbow plot” it is evident that K = 10 makes a lot more sense. 10 makes sense because it is located at the “elbow” Thus a data scientist would conclude that with 10 clusters based on counties it is possible to highlight counties’ percentage in 2004 compared with 2020.

Question 3 [4 points]

Now open the FederalistPaperCorpusTidy.Rds dataset (download from here). Require the tidytext package (install it if you haven’t yet) and tokenize the data via the unnest_tokens() function, stemming the words via the token = "word_stems" input. Remove stop words and then calculate the most frequently used words by author. Plot the top 10 words by author and interpret the results. Do you notice any patterns in how different authors write?

# INSERT CODE HERE
require(tidytext)
## Loading required package: tidytext
dat <- readRDS("../data/FederalistPaperCorpusTidy.rds")

tokenize <- dat %>%
  unnest_tokens (word, text, "word_stems") %>%
  mutate(word = gsub("\\d+", "", word)) %>% 
  filter(word != "")
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
data("stop_words", package = "tidytext")

tokenize <- anti_join(tokenize, stop_words, by = "word")

tokenize <- tokenize %>%
  filter(!word %in% stop_words$word)

tokenize %>%
  group_by(author) %>%
  count(word) %>%
  top_n(10, wt = n) %>%
  arrange(-n) %>%
  ggplot(aes(x=n, y = word, fill = author)) +
  geom_bar(stat = "identity") +
  facet_wrap(~author, scales = "free") +
  theme(legend.position = "top") +
  labs(title = "Top 10 Words by Author",
       subtitle = "Data from FederalistPaperCorpusTidy.rds")

tokenize %>%
  group_by(author) %>%
  count(word) %>%
  arrange(-n)
## # A tibble: 10,450 × 3
## # Groups:   author [5]
##    author   word          n
##    <chr>    <chr>     <int>
##  1 hamilton power       520
##  2 hamilton govern      509
##  3 hamilton nation      366
##  4 hamilton ani         362
##  5 hamilton constitut   359
##  6 madison  govern      304
##  7 madison  power       280
##  8 hamilton author      242
##  9 hamilton union       240
## 10 hamilton law         225
## # … with 10,440 more rows
  • Based on the plots, goverment and power are the two most common words. Interestingly enough, Hamilton uses the word power more than government, something not seen by any of the other authors or combinations of authors. Meanwhile, Jay uses the word nation the most while its use is less common for Hamilton and not even prevalent for Madison.

Question 4 [4 points]

Create an author-term matrix (analogous to a document term matrix except organized by author). Then calculate the TF-IDF by author and plot the top 10 words by TF-IDF for each author. Do you observe any noticeable differences now?

# INSERT CODE HERE
require(tm)
## Loading required package: tm
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
atmatrix <- tokenize %>% count(author, word)

tfidf <- bind_tf_idf(atmatrix, word, author, n)

tfidf %>%
  group_by(author) %>%
  top_n(10, wt = tf_idf) %>%
  ggplot(aes(x = tf_idf, y = reorder(word, tf_idf), fill = author)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Words by Author using TF - IDF",
       subtitle = "Data from FederalistPaperCorpusTidy.rds",
       x = 'TF - IDF',
       y = NULL) +
  facet_wrap(~author,scales = "free",ncol = 2) +
  theme(legend.position = "top")

tfidf %>%
  group_by(author) %>%
  arrange(-tf_idf) %>%
  slice(1:10)
## # A tibble: 50 × 6
## # Groups:   author [5]
##    author    word           n       tf   idf  tf_idf
##    <chr>     <chr>      <int>    <dbl> <dbl>   <dbl>
##  1 contested branch        38 0.00486  0.511 0.00248
##  2 contested elect         78 0.00998  0.223 0.00223
##  3 contested slave         14 0.00179  0.916 0.00164
##  4 contested object        50 0.00640  0.223 0.00143
##  5 contested rule          18 0.00230  0.511 0.00118
##  6 contested biennial      10 0.00128  0.916 0.00117
##  7 contested legislatur    41 0.00525  0.223 0.00117
##  8 contested republican    17 0.00218  0.511 0.00111
##  9 contested annual        16 0.00205  0.511 0.00105
## 10 contested septenni       5 0.000640 1.61  0.00103
## # … with 40 more rows
  • Since this gets rid of the most commonly used words by multiple authors, finding more subtle patterns is possible. The top 10 words for each author or multiple authors is much more unique to each author. It is now possible to see how certain words pertaining to judicial themes are more prevalent for Hamilton while words relating to legislative themes are more prevalent for Madison.

Question 5 [4 points]

Now create a document term matrix (DTM) only using the documents we know were written by Hamilton. As above, calculate the TF-IDF and then use this to estimate k-means clustering on these text data. To do so, start by “casting” the data via the cast_dtm() function. Then calculate the k-means analysis using k = 5 and then visualize the top 10 words per cluster. Can you interpret them? (Hint: use the tidy() function from the tidymodels package to help here.) NB: set nstart = 25 to ensure replicability!

# INSERT CODE HERE
dtm <- tokenize %>%
  filter(author == "hamilton") %>%
  count(document, word)


tfidf <- bind_tf_idf(dtm, word, document, n) 

tfidf %>%
  group_by(document) %>%
  arrange(-tf_idf) %>%
  slice(1:10)
## # A tibble: 510 × 6
## # Groups:   document [51]
##    document word          n      tf   idf  tf_idf
##       <int> <chr>     <int>   <dbl> <dbl>   <dbl>
##  1        1 offer         3 0.00530 1.99  0.0105 
##  2        1 wrong         2 0.00353 2.83  0.0100 
##  3        1 zeal          3 0.00530 1.85  0.00982
##  4        1 effici        2 0.00353 2.32  0.00821
##  5        1 forgotten     2 0.00353 2.14  0.00756
##  6        1 liberti       6 0.0106  0.713 0.00756
##  7        1 heart         2 0.00353 1.99  0.00702
##  8        1 reserv        2 0.00353 1.99  0.00702
##  9        1 discuss       4 0.00707 0.987 0.00698
## 10        1 angri         1 0.00177 3.93  0.00695
## # … with 500 more rows
cast <- cast_dtm(tfidf, document, word, tf_idf)

set.seed(123)

out <- kmeans(cast, centers = 5, nstart = 25)

kmTidy <- tidy(out) %>%
  gather(word,avg_tfidf, -size,-cluster,-withinss)

kmTidy %>%
  group_by(cluster) %>%
  arrange(-avg_tfidf) %>%
  slice(1:10) %>%
  ggplot(aes(x = avg_tfidf, y = reorder(word, avg_tfidf), fill = factor(cluster))) +
  geom_bar(stat = "identity") +
  facet_wrap(~cluster, scales = 'free') +
  labs(title = "Top 10 Words by Hamilton Measured by TF - IDF",
       subtitle = "Data from FederalistPaperCorpusTidy.rds",
       x = "TF - IDF",
       y = NULL,) +
  theme(legend.position = "top")

  • By dividing Hamilton’s work into clusters, it is possible to see more nuanced distribution statistics. For example, the third cluster presumably pertaining to more geographical location themes, indicates that Hamilton uses Northern more than Southern. This is especially interesting because it sheds insight into Hamilton’s most important concerns. Similarly, the second cluster seems to deal with the judicial themes and the fifth cluster deals with executive themes.

EXTRA CREDIT [4 points]

Re-do question 5 but on Madison instead of Hamilton. Do you notice any differences between the clusters among essays written by Hamilton versus those written by Madison?

# INSERT CODE HERE
dtm <- tokenize %>%
  filter(author == "madison") %>%
  count(document, word)


tfidf <- bind_tf_idf(dtm, word, document, n) 

tfidf %>%
  group_by(document) %>%
  arrange(-tf_idf) %>%
  slice(1:10)
## # A tibble: 150 × 6
## # Groups:   document [15]
##    document word           n      tf   idf  tf_idf
##       <int> <chr>      <int>   <dbl> <dbl>   <dbl>
##  1       10 faction       17 0.0158  1.10  0.0174 
##  2       10 parti         20 0.0186  0.916 0.0171 
##  3       10 major         12 0.0112  0.916 0.0102 
##  4       10 factious       4 0.00372 2.71  0.0101 
##  5       10 injustic       4 0.00372 2.71  0.0101 
##  6       10 cure           5 0.00466 2.01  0.00938
##  7       10 passion       10 0.00931 0.916 0.00853
##  8       10 properti       8 0.00745 1.10  0.00818
##  9       10 unabl          3 0.00279 2.71  0.00756
## 10       10 manufactur     4 0.00372 2.01  0.00750
## # … with 140 more rows
cast <- cast_dtm(tfidf, document, word, tf_idf)

set.seed(123)

out <- kmeans(cast, centers = 5, nstart = 25)

kmTidy <- tidy(out) %>%
  gather(word,avg_tfidf, -size,-cluster,-withinss)

kmTidy %>%
  group_by(cluster) %>%
  arrange(-avg_tfidf) %>%
  slice(1:10) %>%
  ggplot(aes(x = avg_tfidf, y = reorder(word, avg_tfidf), fill = factor(cluster))) +
  geom_bar(stat = "identity") +
  facet_wrap(~cluster, scales = 'free') +
  labs(title = "Top 10 Words by Madison Measured by TF - IDF",
       subtitle = "Data from FederalistPaperCorpusTidy.rds",
       x = "TF - IDF",
       y = NULL,) +
  theme(legend.position = "top")

  • Similar to the clusters of Hamilton’s work, Madison’s third cluster focuses on the judicial branch. Interestingly, Madison’s first cluster seems to focus on social divisions with factions and partitions which was not nearly as apparant in Hamilton’s clustering. Overall, Madison’s clusters are much broader in comparison to Hamilton’s more rigid clusters.